Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

11.2 Markov Chains

127

make from a long, unbroken observation. The problem becomes particularly acute

when evidence for higher order Markov chains is sought, when the quantity of data

required might be unattainable. ⁸An important result is Whittle’s formula giving the

distribution of the transition count:

N ⁽ⁿ⁾

uv ⁽^F⁾⁼^F^∗

uv ^,

(11.9)

whereupper N Subscript u v Superscript left parenthesis n right parenthesis Baseline left parenthesis upper F right parenthesisN ⁽ⁿ⁾

uv ⁽^F⁾^{is the number of sequences}^{left parenthesis a 1 comma a 2 comma ellipsis comma a Subscript n plus 1 Baseline right parenthesis}⁽^a¹^,^a²^{, . . . ,}^aⁿ⁺¹⁾^{having transition count}

upper F equals StartSet f Subscript i j Baseline EndSetF = {fij}, and satisfyinga 1 equals ua1 = u anda Subscript n plus 1 Baseline equals van+1 = v. The transition count together with the

initial state (with probability p Subscript a 1pa1) forms a sufﬁcient statistic for the process, since

pa1pa1a2 . . . panan+1 = pa1

fij

ij ^,

(11.10)

where the left-hand side is simply the probability of realizing a particular sequence

StartSet x 1 comma x 2 comma ellipsis comma x Subscript n plus 1 Baseline EndSet{x1, x2, . . . , xn+1}. For i comma j equals 1 comma ellipsis comma si, j = 1, . . . , s, f Subscript i jfij is the number of mm, with 1 less than or equals m less than or equals n1 ≤m ≤n, for

whicha Subscript m Baseline equals iam = i anda Subscript m plus 1 Baseline equals jam+1 = j;upper FF is therefore ans times ss × s matrix, such thatsigma summation Underscript i j Endscripts f Subscript i j Baseline equals n^Σ

ij ^f^ij⁼ⁿ^and

such thatf Subscript i dot Baseline minus f Subscript dot i Baseline equals delta Subscript i u Baseline minus delta Subscript i v Baseline comma i equals 1 comma ellipsis comma sfi· −f·i = δiu −δiv, i = 1, . . . , s, for some pairu comma vu, v, wheref Subscript i dot Baseline equals sigma summation Underscript j Endscripts f Subscript i jfi· = ^Σ

j ^f^ij^{, and}

StartSet f Subscript i dot Baseline EndSet{fi·} andStartSet f Subscript dot j Baseline EndSet{f·j} are the frequency counts ofStartSet a 1 comma ellipsis comma a Subscript n Baseline EndSet{a1, . . . , an} andStartSet a 2 comma ellipsis comma a Subscript n plus 1 Baseline EndSet{a2, . . . , an+1}, respectively,

from which f Subscript i dot Baseline minus f Subscript dot i Baseline equals delta Subscript i a 1 Baseline minus delta Subscript i a Sub Subscript n plus 1fi· −f·i = δia1 −δian+1. In Eq. (11.9), upper F Subscript u v Superscript asteriskF^∗

uv ^{is the}^{left parenthesis v comma u right parenthesis}⁽^v^,^u⁾^{th cofactor of the}

matrix upper F Superscript asterisk Baseline equals f Subscript i j Superscript asteriskF^∗= f ^∗

ij ^{, with components}

f ^∗

ij ⁼

{δij − fij/fi· if

fi· > 0

δij

fi· = 0 .

(11.11)

Problem. Prove that if upper PP is stochastic, then any power of upper PP is also stochastic.

The entropy of the transitions (i.e., the weighted variety of the transitions) can

be found from each row of the stochastic matrix according to Eq. (6.5). The (infor-

mational) entropy of the process as a whole is then the weighted average of these

entropies, the weighting being given by the equilibrium distribution of the states.

Hence, in a sense the entropy of a Markov process is an average of averages.

Problem. Consider the three-state Markov chain

right arrow→

0.1

0.9

0.0

0.5

0.0

0.5

0.3

0.4

and calculate (i) the equilibrium proportions of the states 1, 2, and 3 and (ii) the

average entropy of the entire process.

8 See Billingsley, especially for the proof of Whittle’s formula, Eq. (11.9).